tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193) by hyperpolymath · Pull Request #194 · hyperpolymath/standards

hyperpolymath · 2026-05-26T11:19:50Z

Summary

Durable tooling for the wrapper-sweep work that follows each of the four foundational reusable PRs filed today (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan).

Adds scripts/sweep-classifiers/:

classify-mirror.sh — for feat(governance): add mirror-reusable.yml — consolidate 289-repo mirror.yml drift #187
classify-secret-scanner.sh — for feat(governance): add secret-scanner-reusable.yml — propagate shell-secrets to 281 repos #190
classify-codeql.sh — for feat(governance): add codeql-reusable.yml — consolidate 263-repo codeql.yml drift #192
classify-hypatia-scan.sh — for feat(governance): add hypatia-scan-reusable.yml — biggest LOC leverage of the reusable trilogy #193
README.adoc — usage + nested-path caveat

What each classifier does

Reads a paginated gh api /search/code JSON dump for the template
Fetches each unique blob SHA exactly once (cached in $BLOBS_DIR)
Classifies each blob (job-set match, line-count band, language matrix)
Emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>

Numbers produced across the four campaign templates

Template	TRIVIAL / mechanical	NEEDS_REVIEW	Notable
mirror.yml	267/289 (92.4%)	22	16 slim 2-3 forge variants
secret-scanner	273/281 (97.2%) MISSING_SHELL_SECRETS	3	Only `standards` repo carries `shell-secrets` today
codeql	246/263 (93.5%)	17	11 custom 99-114-line workflows
hypatia-scan	249/255 (97.6%)	6	Pure propagation lag, no real customisation

Nested-path caveat (documented in README)

gh api /search/code with path:.github/workflows matches the path
PREFIX — monorepo nested workflow files (e.g.,
a2ml/bindings/deno/.github/workflows/hypatia-scan.yml) are EXCLUDED.
Verified for hypatia-scan: broader query without path: returns 704
results vs 255 path-filtered. The same effect likely applies to the
other three templates; sweep tooling must walk all
**/.github/workflows/<template>.yml paths.

Pattern

Same shape as scripts/apply-baseline.sh (paired with
scripts/tests/apply-baseline-test.sh) — committed durable tooling
rather than ephemeral /tmp scripts.

🤖 Generated with Claude Code

…-workflow campaign Durable tooling for the wrapper-sweep work that follows each of the foundational reusable PRs (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan). Each classifier: - reads a paginated `gh api /search/code` JSON dump - fetches each unique blob SHA exactly once (cached in $BLOBS_DIR) - emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details> Classes vary per template but follow the same shape: TRIVIAL (canonical match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag, auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW (custom workflow body, requires per-repo diff). Numbers produced by these classifiers across the four campaign templates: - mirror.yml — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW - secret-scanner — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself) - codeql — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW - hypatia-scan — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW README documents the path-filter caveat: `gh api /search/code` with `path:.github/workflows` excludes monorepo-nested workflow files; the broader `filename:` query (no path filter) catches them. For hypatia-scan, the broader query returns 704 vs the 255 path-filtered count — the ~449 nested copies also need wrappers when sweeps fire.

Same as #192 (codeql-reusable) — auto-merge enabled but zero workflow runs against the head commit. Pushing empty commit to re-trigger CI.

…ergence set (#205) ## Summary 5th and final reusable in the workflow convergence campaign (see #199 for the meta-doc). Consolidates the per-repo `scorecard.yml` workflow. ## Drift signal (full pagination + per-repo verified) - **258** top-level estate deployments - **626** nested copies in monorepos (asdf-tool-plugins, developer-ecosystem, ssg-collection, standards, ambientops, julia-ecosystem, etc. — Layer-2 truncation discovery via #204's helper) - **46** unique blob SHAs / 17.8% structural drift - Top SHA covers **100/258 (38.8%)** — highest dominant-cluster of the 5 campaigns - Top 7 SHAs cover ~80% - **100% mechanical drift, ZERO feature variance** — SPDX header (PMPL-1.0 / PMPL-1.0-or-later / MPL-2.0), `upload-sarif` SHA-pin churn, `permissions: read-all` vs `contents: read` wording ## Design - One input: `runs-on` (default ubuntu-latest) - No `secrets: inherit` — Scorecard uses `GITHUB_TOKEN` directly - Caller MUST grant `security-events: write` + `id-token: write` on the calling job (called-workflow permissions are capped by caller) - Caller keeps own `on:` triggers + `concurrency:` group ## Per Layer-3 caveat from the campaign meta-doc Nested workflows are inert — GitHub Actions only runs `.github/workflows/` at the repo root. Sweeping the 626 nested copies is single-source-of-truth cleanup, not security hardening. ## Campaign convergence set (closes with this PR) | PR | Template | |---|---| | #187 | mirror-reusable.yml | | #190 | secret-scanner-reusable.yml | | #192 | codeql-reusable.yml | | #193 | hypatia-scan-reusable.yml | | #194 | sweep-classifier scripts | | #199 | campaign meta-doc | | #204 | list-workflow-paths.sh (bypass /search/code undercount) | | **this** | **scorecard-reusable.yml** | ## Test plan - [ ] Wrapper sweep (~258 top-level + ~626 nested) — owner-gated; not part of this PR - [ ] Update classify-* scripts to consume helper TSV — follow-up 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…consumers (#204) ## Summary Two-commit change adding nested-path support to the sweep-classifier pipeline: 1. **`scripts/sweep-classifiers/list-workflow-paths.sh`** — walks `gh repo list` and queries each repo's Git Tree API directly. Bypasses two compounding undercounts in `gh api /search/code`. 2. **All 4 `classify-*.sh` scripts updated** to consume the helper's TSV output and emit the sweep-target path as an explicit column. ## Why the helper exists — 3 layers of undercount 1. **Layer 1 — path-prefix filter:** `path:.github/workflows` matches the path PREFIX, excluding nested `<subdir>/.github/workflows/<file>.yml` paths outright. 2. **Layer 2 — org-scope truncation:** even broad `filename:<file>.yml org:<org>` queries hit internal caps. Validated against `scorecard.yml`: broad query saw 152 paths (all flagged top-level); per-repo enumeration found **626 additional nested copies** the broad query missed entirely. 3. **Layer 3 — nested workflows are inert:** GitHub Actions only runs `.github/workflows/` at the repo root. Nested copies are vendored templates / stale leftover. Security campaigns gain nothing from sweeping nested copies; single-source-of-truth campaigns still benefit. ## Helper output TSV, one row per matching workflow file: ``` <repo>\t<path>\t<blob-sha>\t<top-level|nested> ``` Cost: one Git Tree API call per repo (~300 calls), uses `core` bucket (5000/hr) not throttled `code_search` (10/min). ## Classifier extensions Each `classify-*.sh` now auto-detects input format from the first byte: - `{` → JSONL from `gh /search/code` (legacy path) - otherwise → TSV from `list-workflow-paths.sh` (preferred — handles nested) Output is unified to 7 columns: `repo \t path \t sha \t class \t reason \t lines \t details`. The new `path` column carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites. Shared `normalize_input` extracted into `_lib.sh`; each classifier sources it. ## Validation Smoke-tested both input paths: - TSV (helper): classify-mirror.sh on scorecard-tuples.tsv (287 repos × top-level + nested) — fetches blobs and emits per-(repo, path) rows. - JSONL (legacy): classify-mirror.sh on mirror-full.json — 267 TRIVIAL + 22 NEEDS_REVIEW, matching prior `/tmp/drift-survey/sweep-report.md`. ## Stacked on #194 `scripts/sweep-classifiers/` only exists once #194 merges. The diff against `main` includes #194's files transitively; once #194 lands, this PR narrows to just the helper + extensions. ## Standing follow-ups - Once this lands, re-survey each candidate with the helper for ground-truth wrapper-site counts before firing any sweep. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

hyperpolymath enabled auto-merge (squash) May 26, 2026 11:22

This was referenced May 26, 2026

docs(audits): workflow convergence campaign 2026-05-26 #199

Open

tooling(scripts): nested-path support — Git Tree helper + classifier consumers #204

Merged

feat(governance): add scorecard-reusable.yml — close 5-candidate convergence set #205

Merged

ci: kick — initial PR push appears not to have triggered Actions

7e83955

Same as #192 (codeql-reusable) — auto-merge enabled but zero workflow runs against the head commit. Pushing empty commit to re-trigger CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194

tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194
hyperpolymath wants to merge 2 commits into
mainfrom
feat/sweep-classifiers

hyperpolymath commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hyperpolymath commented May 26, 2026

Summary

What each classifier does

Numbers produced across the four campaign templates

Nested-path caveat (documented in README)

Pattern

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant